2018-03-04

【机器学习系列】特征工程

简介

there’re now hundreds or perhaps thousands of researchers who’ve spent years of their
lives slowly and laboriously hand-engineering vision, audio or text features. While much of this feature-engineering work is extremely clever, one has to wonder if we can do better. Certainly this labor-intensive hand-engineering approach does not scale well to new problems; further, ideally we’d like to have algorithms that can automatically learn even better feature representations than the hand-engineered ones. – 来自ufldl

特征类别

离散特征

连续特征 (continuous features)

Bucketization turns a continuous column into a categorical column. This transformation lets you use continuous features in feature crosses, or learn cases where specific value ranges have particular importance.

Bucketization

1 2	age_buckets = tf.feature_column.bucketized_column( age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

为什么要Bucketization?

类别特征 Categorical features

文本中的word

图像中的像素点，不是。因为具有大小有意义。

通常被转化为稀疏向量。(比如FM中对user_id的处理，NLP中的one-hot表达，)

比如
‘eye_color’的’brown’表示为[1, 0, 0], ‘blue’ 表示为[0, 1, 0] and ‘green’表示为[0, 0, 1].

称之为稀疏向量，因为多数情况下向量维度高，仅只有一个非零值。

为什么不用一个点表示？比如表示成 1 2 3？这样只占用一个维度。

典型的特征处理方法

one-hot处理
还有hash-bucket处理

挑战

未知的类别
- 可采用categorical_column_with_hash_bucket()
ss

特征工程

Base Feature Column

通常原始特征表达能力不够，通常大家会在原始特征的基础上人工设计一些特征。特征设计的好坏对整个系统至关重要。设计的好，效果好，外接简单的分类器或其他模型就能够取得较好的效果。因此工业界很青睐这种方法。

Feature Crosses

组合特征，这仅仅适用于sparser特征.产生的依然是sparsor特征

1 2	sport_x_city = tf.feature_column.crossed_column( ["sport", "city"], hash_bucket_size=int(1e4))

###

挑战 & 缺陷

特征工程

特征选择

参考

Wide & Deep Learning Tutorial | TensorFlow | code
- 浅层模型(比如LR)要提高效果，一般会提前大量的人工特征，即wide模型
- DNN一般采用end-to-end的模型，建立原始特征-->label的映射。
- wide & deep则是结合两者的有点。